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HIGH-DIMENSIONAL COVARIANCE MATRIX ESTIMATION IN 
APPROXIMATE FACTOR MODELS^ 

By Jianqing Fan, Yuan Liao and Martina Mincheva 

Princeton University 

The variance-covariance matrix plays a central role in the infer- 
ential theories of high-dimensional factor models in finance and eco- 
nomics. Popular regularization methods of directly exploiting spar- 
sity are not directly applicable to many financial problems. Classical 
methods of estimating the covariance matrices are based on the strict 
factor models, assuming independent idiosyncratic components. This 
assumption, however, is restrictive in practical applications. By as- 
suming sparse error covariance matrix, we allow the presence of the 
cross-sectional correlation even after taking out common factors, and 
it enables us to combine the merits of both methods. We estimate 
the sparse covariance using the adaptive thresholding technique as 
in Cai and Liu [J. Amer. Statist. Assoc. 106 (2011) 672-684], tak- 
ing into account the fact that direct observations of the idiosyncratic 
components are unavailable. The impact of high dimensionality on 
the covariance matrix estimation based on the factor structure is 
then studied. 

1. Introduction. We consider a factor model defined as follows: 
(1-1) yit = h'iit + Uit, 

where ya is the observed datum for the ith [i = asset at time 

t = 1, . . . ,T; hi is du K x \ vector of factor loadings; is a X x 1 vector of 
common factors, and uu is the idiosyncratic error component of ya- Classical 
factor analysis assumes that both p and K are fixed, while T is allowed to 
grow. However, in the recent decades, both economic and financial applica- 
tions have encountered very large data sets which contain high-dimensional 
variables. For example, the World Bank has data for about two hundred 
countries over forty years; in portfolio allocation, the number of stocks can 
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be in thousands and be larger or of the same order of the sample size. In 
modeling housing prices in each zip code, the number of regions can be of 
order thousands, yet the sample size can be 240 months or twenty years. The 
covariance matrix of order several thousands is critical for understanding the 
CO- movement of housing prices indices over these zip codes. 

Inferential theory of factor analysis relies on estimating 'Su, the variance- 
covariance matrix of the error term, and S, the variance-covariance matrix 
of yt = {uit, ■ ■ ■ tVpt)' ■ In the literature, XI = cov(yt) was traditionally esti- 
mated by the sample covariance matrix of y^. 



which was always assumed to be pointwise root-T consistent. However, the 
sample covariance matrix is an inappropriate estimator in high-dimensional 
settings. For example, when p is larger than T, '^sam becomes singular 
while S is always strictly positive definite. Even if p < T, Fan, Fan and 
Lv (2008) showed that this estimator has a very slow convergence rate un- 
der the Frobenius norm. Realizing the limitation of the sample covariance 
estimator in high-dimensional factor models. Fan, Fan and Lv (2008) con- 
sidered more refined estimation of S, by incorporating the common factor 
structure. One of the key assumptions they made was the cross-sectional in- 
dependence among the idiosyncratic components, which results in a diagonal 
matrix S.^ = £^UfUj. The cross-sectional independence, however, is restric- 
tive in many applications, as it rules out the approximate factor structure 
as in Chamberlain and Rothschild (1983). In this paper, we relax this as- 
sumption, and investigate the impact of the cross-sectional correlations of 
idiosyncratic noises on the estimation of S and when both p and T 
are allowed to diverge. We show that the estimated covariance matrices are 
still invertible with probability approaching one, even if p > T. In particular, 
when estimating S"^ and 5^^^, we allow p to increase much faster than T, 
say, p = 0(exp(r")), for some a € (0, 1). 

Sparsity is one of the commonly used assumptions in the estimation of 
high-dimensional covariance matrices, which assumes that many entries of 
the off-diagonal elements are zero, and the number of nonzero off-diagonal 
entries is restricted to grow slowly. Imposing the sparsity assumption directly 
on the covariance of yt, however, is inappropriate for many applications of 
finance and economics. In this paper we use the factor model and assume 
that 5]^ is sparse, and estimate both 5]„ and using the thresholding 
method [Bickel and Levina (2008a), Cai and Liu (2011)] based on the es- 
timated residuals in the factor model. It is assumed that the factors ^t are 
observable, as in Fama and French (1992), Fan, Fan and Lv (2008), and 
many other empirical applications. We derive the convergence rates of both 
estimated S and its inverse, respectively, under various norms which are to 




t=i 
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be defined later. In addition, we achieve better convergence rates than those 
in Fan, Fan and Lv (2008). 

Various approaches have been proposed to estimate a large covariance 
matrix: Bickel and Levina (2008a, 2008b) constructed the estimators based 
on regularization and thresholding, respectively. Rothman, Levina and Zhu 
(2009) considered thresholding the sample covariance matrix with more gen- 
eral thresholding functions. Lam and Fan (2009) proposed penalized quasi- 
likelihood method to achieve both the consistency and sparsistency of the 
estimation. More recently, Cai and Zhou (2010) derived the minimax rate 
for sparse matrix estimation, and showed that the thresholding estimator 
attains this optimal rate under the operator norm. Cai and Liu (2011) pro- 
posed a thresholding procedure which is adaptive to the variability of indi- 
vidual entries and unveiled its improved rate of convergence. 

The rest of the paper is organized as follows. Section 2 provides the asymp- 
totic theory for estimating the error covariance matrix and its inverse. Sec- 
tion 3 considers estimating the covariance matrix of yt ■ Section 4 extends the 
results to the seemingly unrelated regression model, a set of linear equations 
with correlated error terms in which the covariates are different across equa- 
tions. Section 5 reports the simulation results. Finally, Section 6 concludes 
with discussions. All proofs are given in the Appendix. Throughout the pa- 
per, we use Aniin(A) and Amax(A) to denote the minimum and maximum 
eigenvalues of a matrix A. We also denote by || A||j7, || A|| and || A||max the 
Frobenius norm, operator norm and elementwise norm of a matrix A, respec- 
tively, defined, respectively, as || AHi;' = tr^/^(A'A), ||A|| = Amax(A'A) and 
||A||max = niaxjj- \ Aij\. Note that, when A is a vector, both ||A|| and ||A||i? 
are equal to the Euclidean norm. 

2. Estimation of error covariance matrix. 

2.1. Adaptive thresholding. Consider the following approximate factor 
model, in which the cross-sectional correlation among the idiosyncratic error 
components is allowed: 



where i = 1,. . . ,p and t = 1, . . . ,T; bj is a K x 1 vector of factor loadings; fj 
is a i^T X 1 vector of observable common factors, uncorrelated with uu- Write 



with E{ut\it) = 0. 

In practical applications, p can be thought of as the number of assets 
or stocks, or number of regions in spatial and temporal problems such as 
home price indices or sales of drugs, and in practice can be of the same 




yu = b-fj + Uit 



B = (bi,...,bp)', yt = (yit,-.-,ypt)'; ut = {h 
then model (2.1) can be written in a more compact form, 
(2.2) yt = Bit + ut 
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order as, or even larger than T. For example, an asset pricing model may 
contain hundreds of assets while the sample size on daily returns is less than 
several hundreds. In the estimation of the optimal portfolio allocation, it 
was observed by Fan, Fan and Lv (2008) that the effect of large p on the 
convergence rate can be quite severe. In contrast, the number of common 
factors, can be much smaller. For example, the rank theory of consumer 
demand systems implies no more than three factors [e.g., Gorman (1981) 
and Lewbel (1991)]. 

The error covariance matrix 



itself is of interest for the inferential theory of factor models. For example, the 
asymptotic covariance of the least square estimator of B depends on 
and in simulating home price indices over a certain time horizon for mortgage 
based securities, a good estimate of is needed. When p is close to or 
larger than T, estimating I]„ is very challenging. Therefore, following the 
literature of high-dimensional covariance matrix estimation, we assume it 
is sparse, that is, many of its off-diagonal entries are zeros. Specifically, let 
'^u = {crij)pxp- Define 

(2.3) 7717^ = maxN /(cjj,/0). 



The sparsity assumption puts an upper bound restriction on niT- Specifi- 
cally, we assume 



In this formulation, we even allow the number of factors K to be large, 
possibly growing with T. 

A more general treatment [e.g., Bickel and Levina (2008a) and Cai and Liu 
(2011)] is to assume that the Iq norm of the row vectors of are uniformly 
bounded across rows by a slowly growing sequence, for some q € [0, 1). In con- 
trast, the assumption we make in this paper, that is, q = 0, has clearer eco- 
nomic interpretation. For example, the firm returns can be modeled by the 
factor model, where uu represents a firm's individual shock at time t. Driven 
by the industry-specific components, these shocks are correlated among the 
firms in the same industry, but can be assumed to be uncorrelated across 
industries, since the industry-specific components are not pervasive for the 
whole economy [Connor and Korajczyk (1993)]. 

We estimate using the thresholding technique introduced and studied 
by Bickel and Levina (2008a), Rothman, Levina and Zhu (2009), Cai and 
Liu (2011), etc., which is summarized as follows: Suppose we observe data 
(Xi,...,Xt') of a p X 1 vector X, which follows a multivariate Gaussian 



T,u = cov(ut) 




(2.4) 
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distribution A^(0, Sx)- The sample covariance matrix of X is thus given by 

= '^O^i ~ ~ = iSij)pxp- 

i=l 

Define the thresholding operator by 7t(M) = {MijI{\Mij \ > t)) for any sym- 
metric matrix M. Then % preserves the symmetry of M. Let = 7I,j,(Sx), 
where ojt = 0{y^logp/T). Bickel and Levina (2008a) then showed that 

- Sxll = Op{ujTmT). 

In the factor models, however, we do not observe the error term directly. 
Hence when estimating the error covariance matrix of a factor model, we 
need to construct a sample covariance matrix based on the residuals uu 
before thresholding. The residuals are obtained using the plug-in method, 
by estimating the factor loadings first. Let b, be the ordinary least square 
(OLS) estimator of bj, and 

Uit = yu - b-ff. 

Denote by ut = {uu, ■ ■ ■ jUpt)' ■ We then construct the residual covariance 
matrix as 



t=i 



Note that the thresholding value ujt = 0(-\/logp/r) in Bickel and Levina 
(2008a) is in fact obtained from the rate of convergence of maxjj | Sjj — 'Sx,ij \ ■ 
This rate changes when Sij is replaced with the residual Uij, which will be 
slower if the number of common factors K increases with T. Therefore, the 
thresholding value lot used in this paper is adjusted to account for the effect 
of the estimation of the residuals. 

2.2. Asymptotic properties of the thresholding estimator. Bickel and Lev- 
ina (2008a) used a universal constant as the thresholding value. As pointed 
out by Rothman, Levina and Zhu (2009) and Cai and Liu (2011), when the 
variances of the entries of the sample covariance matrix vary over a wide 
range, it is more desirable to use thresholds that capture the variability of 
individual estimation. For this purpose, in this paper, we apply the adaptive 
thresholding estimator [Cai and Liu (2011)] to estimate the error covariance 
matrix, which is given by 



(2.5) 



t=i 

for some lot to be specified later. 
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We impose the following assumptions: 

Assumption 2.1. (i) {ut}t>i is stationary and ergodic such that each 
has zero mean vector and covariance matrix In addition, the strong 
mixing condition in Assumption 3.2 holds. 

(ii) There exist constants ci, C2 > such that ci < Amin(Su) < XmaxC^u) < 
C2, and ci < vav{uitUjt) < C2 for all i < p, j < p. 

(iii) There exist ri > and bi > 0, such that for any s > and i <p, 

(2.6) P{\uu\>s)<eM-{s/biY'). 

Condition (i) allows the idiosyncratic components to be weakly depen- 
dent. We will formally present the strong mixing condition in the next sec- 
tion. In order for the main results in this section to hold, it suffices to impose 
the strong mixing condition marginally on only. Roughly speaking, we 
require the mixing coefficient 

a{T)= sup \P{A)P{B)-P{AnB)\ 

to decrease exponentially fast as T — )• oo, where {J^^,F^) are the a- 
algebras generated by {u.t]^^-oo ^-i^cl {ufj^j,, respectively. 

Condition (ii) requires the nonsingularity of S^. Note that Cai and Liu 
(2011) allowed maxjajj to diverse when direct observations are available. 
Condition (ii), however, requires that ajj should be uniformly bounded. In 
factor models, a uniform upper bound on the variance of uu is needed when 
we estimate the covariance matrix of yt later. This assumption is satisfied 
by most of the applications of factor models. Condition (iii) requires the 
distributions of {uu, . . . ,Upt) to have exponential-type tails, which allows us 
to apply the large deviation theory to Ylt=i '^itUjt — crij. 

Assumption 2.2. There exist positive sequences Ki{p,T) = o(l), K2{p, 
T) = oil) and = o(l), and a constant M > 0, such that for all C > M, 

P( max;^ \uit - Uit\^ > Ca\ ) < 0(ki(p,T)), 

p(^^max^\uit -Uit\>C^ < 0{k2{p,T)). 

This assumption allows us to apply thresholding to the estimated error 
covariance matrix when direct observations are not available, without intro- 
ducing too much extra estimation error. Note that it permits a general case 
when the original "data" is contaminated, including any type of estimate of 
the data when direct observations are not available, as well as the case when 
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data is subject to measurement of errors. We will show in the next section 
that in a linear factor model when {uit}i<p^t<T are estimated using the OLS 
estimator, the rate of convergence a|i = {K'^logp)/T. 

The following theorem establishes the asymptotic properties of the thresh- 
olding estimator S^, based on observations with estimation errors. Let 
= 3r^^ + r^^, where ri and r2 are defined in Assumptions 2.1, 3.2, 
respectively. 

Theorem 2.1. Suppose 7 < 1 and (logp)^^^~^ = o(T). Then under As- 
sumptions 2.1 and 2.2, there exist Ci > and C2 > such that for 
defined in (2.5) with 



ijjT = C\ \ \\ —— + ar 



we have 



(2.7) P{\\tl - S^ll < C2UJTmT) >l-o(^^ + Ki{p,T) + aj2(p,T)^ . 

In addition, ifuT'mT = o{l), then with probability at least l — 0{\ + Ki{p,T)-\- 
i^2{p,T)), 

Amin(S3") > 0.5Anim(Sit) 

and 

\\{±l)-^-^Z^\\<C2u:TmT. 

Note that we derive result (2.7) without assuming the sparsity on Xl^, that 
is, no restriction is imposed on rriT. When aJxtriT 7^ o(l), (2.7) still holds, 
but 11^^ — ^u\\ does not converge to zero in probability. On the other hand, 
the condition ujtitit = o(l) is required to preserve the nonsingularity of 
asymptotically and to consistently estimate 5]^^. 

The rate of convergence also depends on the averaged estimation error of 
the residual terms. We will see in the next section that when the number of 
common factors K increases slowly, the convergence rate in Theorem 2.1 is 
close to the minimax optimal rate as in Cai and Zhou (2010). 



3. Estimation of covariance matrix using factors. We now investigate 
the estimation of the covariance matrix S in the approximate factor model 

yt = Bit + uj, 

where S = cov(yf). This covariance matrix is particularly of interest in many 
applications of factor models as well as corresponding inferential theories. 
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When estimating a large dimensional covariance matrix, sparsity and band- 
ing are two commonly used assumptions for regularization [e.g., Bickel and 
Levina (2008a, 2008b)]. In most of the applications in finance and economics, 
however, these two assumptions are inappropriate for I]. For instance, the 
US housing prices in the county level are generally associated with a few na- 
tional indices, and there is no natural ordering among the counties. Hence 
neither the sparsity nor the banding is realistic for such a problem. On the 
other hand, it is natural to assume I]^ sparse, after controlling the common 
factors. Therefore, our approach combines the merits of both the sparsity 
and factor structures. 
Note that 

S = Bcov(ft)B' + S„. 

By the Sherman-Morrison- Woodbury formula, 

= S-^ - 5]-iB[cov(fi)-i + B'^z''B]-'B'^-\ 

When the factors are observable, one can estimate B by the least squares 
method, B = (bi, . . . , bp)', where 

^ T p 

bj = argniin — ^^(yit -b.fi)2. 
^ P t=i i=i 

The covariance matrix cov(fj) can be estimated by the sample covariance 
matrix 

COT(ft) = T^^XX' - T^^Xll'X', 

where X = (fi, . . . , fp), and 1 is a T-dimensional column vector of ones. 
Therefore, by employing the thresholding estimator in (2.5), we obtain 
substitution estimators 

(3.1) E^ = Bc5v(ft)B' + S^ 

and 

(3.2) (s^)-i = (si)-i - {^i)-'B[^{f,r' + B'i^ir'Br'B'i^ir'. 

In practice, one may apply a common thresholding A to the correlation 
matrix of Xl^, and then use the substitution estimator similar to (3.1). When 
A = (no thresholding), the resulting estimator is the sample covariance, 
whereas when A = l (all off-diagonals are thresholded), the resulting esti- 
mator is an estimator based on the strict factor model [Fan, Fan and Lv 
(2008)]. Thus we have created a path (indexed by A) which connects the 
nonparametric estimate of covariance matrix to the parametric estimate. 

The following assumptions are made: 

Assumption 3.1. (i) {it}t>i is stationary and ergodic. 
(ii) {ut}t>i and {ft}t>i are independent. 
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In addition to the conditions above, we introduce the strong mixing condi- 
tions to conduct asymptotic analysis of the least square estimates. Let J^oo 
and denote the <T-algebras generated by {(fj,Ut):— oo < t < 0} and 
{(fj,Ut) .T <t< oo}, respectively. In addition, define the mixing coefficient 

a(T)= sup \P{A)P{B) - P{AB)\. 

The following strong mixing assumption enables us to apply the Bernstein's 
inequality in the technical proofs. 

Assumption 3.2. There exist positive constants r2 and C such that for 
alHGZ+, 

In addition, we impose the following regularity conditions: 

Assumption 3.3. (i) There exists a constant M > such that for ah i,j 
and t, Eyf^ < M, Efl < M and \bij \ < M. 

(ii) There exists a constant > with 3r^^ + ^"2^^ > 1) a-nd 62 > such 
that for any s > and i < K, 

(3.3) P(|/it|>s)<exp(-(s/52r). 

Condition (ii) allows us to apply the Bernstein-type inequality for the 
weakly dependent data. 

Assumption 3.4. There exists a constant C > such that 

Amm(cOv(ft)) > C. 

Assumptions 3.4 and 2.1 ensure that both Amin(cov(ft)) and Amin(S) are 
bounded away from zero, which is needed to derive the convergence rate of 
below. 

The following lemma verifies Assumption 2.2, which derives the rate of 
convergence of the OLS estimator as well as the estimated residuals. 
Let 72"^ = l.Srf^ + 1.5rJ^ + r^^ . 

Lemma 3.1. Suppose K = o{p), K'^ilogp)'^ = o{T) and (logp)^/'^^-! = 
o{T). Then under the assumptions of Theorem 2.1 and Assumptions 3.1-3.4, 
there exists C > 0, such that: 

(i) 
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(ii) 

p( 1 V^l - ,2^ CK^logp \ ^/ 1 1 



(iii) 



P\ max \uit-Uit 

\ i<P,t<T 



>CA-(l„gT)V..y!^)=0(J, + ^). 



By Lemma 3.1 and Assumption 2.2, ot = Ky^ (logp) /T and Ki{p,T) = 
K,2{p,T) = + r~^. Therefore in the hnear approximate factor model, 
the thresholding parameter ojt defined in Theorem 2.1 is simplified to the 
following: for some positive constant C[, 



(3.4) •^t = C[k/^. 

Now we can apply Theorem 2.1 to obtain the following theorem: 

Theorem 3.1. Under the assumptions of Lemma 3.1, there exist C( > 
and 6*2 > such that the adaptive thresholding estimator defined in (2.5) 

with oj"^ = C'l ^ satisfies: 
(i) 



(ii) // mxK y^^^ = o(l), then with probability at least 1 — 0(^ + 

Amin(S^) > 0.5Amin(^u) 



and 



<C'2mTK 



\ogp 



Remark 3.1. We briefly comment on the terms in the convergence rate 
above. 

(1) The term K appears as an effect of using the estimated residuals 
to construct the thresholding covariance estimator, which is typically small 
compared to p and T in many applications. For instance, the famous Fama- 
French three-factor model shows that K = 2> factors are adequate for the US 
equity market. In an empirical study on asset returns, Bai and Ng (2002) 
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used the monthly data which contains the returns of 4883 stocks for sixty 
months. For their data set, T = 60, p = 4883. Bai and Ng (2002) determined 
K = 2 common factors. 

(2) As in Bickel and Levina (2008a) and Cai and Liu (2011), m^^, the 
maximum number of nonzero components across the rows of also plays 
a role in the convergence rate. Note that when K is bounded, the convergence 
rate reduces to Op {rriT sj (log p)/T), the same as the minimax rate derived 
by Cai and Zhou (2010). 

One of our main objectives is to estimate which is the p x p dimen- 
sional covarinace matrix of yt, assumed to be time invariant. We can achieve 
a better accuracy in estimating both XI and by incorporating the fac- 
tor structure than using the sample covariance matrix, as shown by Fan, 
Fan and Lv (2008) in the strict factor model case. When the cross-sectional 
correlations among the idiosyncratic components (uu, ■ ■ ■ ,Upt) are in pres- 
ence, we can still take advantage of the factor structure. This is particularly 
essential when direct sparsity assumption on 5] is inappropriate. 

Assumption 3.5. ||p~^B'B — = o(l) for some K x K symmetric 
positive definite matrix Q such that Amin(i^) is bounded away from zero. 

Assumption 3.5 requires that the factors should be pervasive, that is, 
impact every individual time series [Harding (2009)]. It was imposed by 
Fan, Fan and Lv (2008) only when they tried to establish the asymptotic 
normality of the covariance estimator. However, it turns out to be also help- 
ful to obtain a good upper bound of ||(S^)~^ — as it ensures that 
A„,ax((B'I]-iB)-i) = 0(p-i). 

Fan, Fan and Lv (2008) obtained an upper bound of ||^^ — under the 
Frobenius norm when is diagonal, that is, there was no cross-sectional 
correlation among the idiosyncratic errors. In order for their upper bound to 
decrease to zero, < T is required. Even with this restrictive assumption, 
they showed that the convergence rate is the same as the usual sample 
covariance matrix of yt , though the latter does not take the factor structure 
into account. Alternatively, they considered the entropy loss norm, proposed 
by James and Stein (1961), 

- Ells = (p^Hr[(S^ - I)2])i/2 =p-i/2||s-V2(gT _ s)I]-V2||^. 

Here the factor is used for normalization, such that = 1. Under 

this norm, Fan, Fan and Lv (2008) showed that the substitution estima- 
tor has a better convergence rate than the usual sample covariance matrix. 
Note that the normalization factor in the definition results in an av- 

eraged estimation error, which also cancels out the diverging dimensionality 
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introduced by p. In addition, for any two p x p matrices Ai and A2, 

||Ai - A2b =P~'/'\\^~'/\A, - A2)^-'/'\\f 

<||S-i/2(Ai-A2)S-^/2|| 

<||Ai-A2||-A,nax(5:"^). 

Combining with the estimated low-rank matrix Bcov(f4)B', Theorem 3.1 
impUes the main theorem in this section: 

Theorem 3.2. Suppose logT = o{p). Under the assumptions of Theo- 
rem 3.1 and Assumption 3.5, we have: 



(i) 



PI IIS'^-SIP < CP^^(^°SP)^ CmlKHogp 

ill IIS — rp2 

PMIVT vl|2 ^ OKHogp + CK'\ogT 



i-o( 4 + 



p2 T2 



(ii) If uitK y^^^ = o(l), with probability at least 1 — + ^i), 

Amin(S''^) > 0.5Amin(Su) 



and 



Note that we have derived a better convergence rate of (^^) ^ than 
that in Fan, Fan and Lv (2008). When the operator norm is considered, p is 
ahowed to grow exponentiahy fast in T in order for (S^)~^ to be consistent. 

We have also derived the maximum elementwise estimation ||S''" — 5]||MAX• 
This quantity appears in risk assessment as in Fan, Zhang and Yu (2008). For 
any portfolio with allocation vector w, the true portfolio variance and the 
estimated one are given by w'Sw and respectively. The estimation 

error is bounded by 

|w'S'^w — w'Sw| < — X1||max||w||^, 

where ||w||i, the h norm of w, is the gross exposure of the portfolio. 
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4. Extension: Seemingly unrelated regression. A seemingly unrelated re- 
gression model [Kmenta and Gilbert (1970)] is a set of linear equations in 
which the disturbances are correlated across equations. Specifically, we have 

(4.1) yit = h'iiit + Uit, i<p,t<T, 

where bj and fjj are both Ki x 1 vectors. The p linear equations (4.1) are 
related because their error terms un are correlated; that is, the covariance 
matrix 

"^u = {EuitUjt)pxp 

is not diagonal. 

Model (4.1) allows each variable yu to have its own factors. This is impor- 
tant for many applications. In financial applications, the returns of individ- 
ual stock depend on common market factors and sector-specific factors. In 
housing price index modeling, housing price appreciations depend on both 
national factors and local economy. When in = it for each i <p, model (4.1) 
reduces to the approximate factor model (1.1) with common factors it- 

Under mild conditions, running OLS on each equation produces unbi- 
ased and consistent estimator of bj separately. However, since OLS does 
not take into account the cross-sectional correlation among the noises, it is 
not efficient. Instead, statisticians obtain the best linear unbiased estimator 
(BLUE) via generalized least square (GLS). Write 

yi = iVii,- ■ ■ ,yiTy,T X 1, Xi = {iii,...,iiTy,T X Ki, i<p, 





X, 


= {i^l 


Xi 






















Xp 





^bA 








Khp) 



The GLS estimator of B is given by Zellner (1962). 

(4.2) Bgls = [X'(S;i ® /t)-^X]-1[X'(S-i It)- V], 

where It denotes a T x T identity matrix, (8) represents the Kronecker prod- 
uct operation and 5]^ is a consistent estimator of S^. 

In classical seemingly unrelated regression in which p does not grow 
with T, Su is estimated by a two-stage procedure [Kmenta and Gilbert 
(1970)]: In the first stage, estimate B via OLS, and obtain residuals 

(4.3) Uit = yit-h'iiu. 
In the second stage, estimate S^j by 

(4.4) S„ J ^ 

pxp 

In high dimensional, seemingly unrelated regression in which p>T, however, 
is not invertible, and hence the GLS estimator (4.2) is infeasible. 
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By the sparsity assumption of S^j , we can deal with this singularity prob- 
lem by using the adaptive thresholding estimator, and produce a consistent 
nonsingular estimator of S^; 



To pursue this goal, we impose the following assumptions: 

Assumption 4.1. For each i <p: 

(i) {fij}t>i is stationary and ergodic. 

(ii) {uf}t>i and {fjt}t>i are independent. 

Assumption 4.2. There exists positive constants C and r2 such that 
for each i <p, the strong mixing condition 



is satisfied by (f(t,Ut). 

Assumption 4.3. There exist constants M and C > such that for all 
i<P,j<K„t<T: 

(i) Eyl < M, \bij\ < M and Efl- < M. 

(ii) mini<pAmin(cov(fjt)) > C. 

Assumption 4.4. There exists a constant r4 > with 3rJ^ + r^^ > 1, 
and 63 > such that for any s > and i,j, 



These assumptions are similar to those made in Section 3, except that 
here they are imposed on the sector-specific factors. The main theorem in 
this section is a direct application of Theorem 2.1, which shows that the 
adaptive thresholding produces a consistent nonsingular estimator of . 

Theorem 4.1. Let K = maxi<pKj and = l.Srf^ -|- l.SrJ-^ -|- r^^; 
suppose K = o{p), K^ilogpY = o{T) and (logp)^/'^^^^ = o(T). Under As- 
sumptions 2.1, 4- 1-4-4! there exist constants Ci > and C2 > such that 
the adaptive thresholding estimator defined in (4-5) with lo^ = Ci ^ j?^^ sat- 



(4.5) 




t=i 



a(t) <exp(-Cf'2) 



P{\fit,j\>s)<eM-{s/b3r). 
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(ii) IfmrK^- 



= 0(1), then with probability at least 1 — + j^), 



and 



\\{j:lr^-T.Z'\\<C2mTK^. 




Therefore, in the case when p> T, Theorem 4.1 enables us to efficiently 
estimate B via feasible GLS. 



5. Monte Carlo experiments. In this section, we use simulation to demon- 
strate the rates of convergence of the estimators and that we 
have obtained so far. The simulation model is a modified version of the 
Fama-Prench three-factor model described in Fan, Fan and Lv (2008). We 
fix the number of factors, K = 3, and the length of time, T = 500, and let 
the dimensionality p gradually increase. 

The Fama-French three-factor model [Fama and French (1992)] is given by 



which models the excess return (real rate of return minus risk-free rate) of 
the ith. stock of a portfolio, yu, with respect to 3 factors. The first factor is the 
excess return of the whole stock market, and the weighted excess return on 
all NASDAQ, AMEX and NYSE stocks is a commonly used proxy. It extends 
the capital assets pricing model (CAPM) by adding two new factors — SMB 
("small minus big" cap) and HML ("high minus low" book/price). These two 
were added to the model after the observation that two types of stocks — 
small caps, and high book value to price ratio — tend to outperform the stock 
market as a whole. 

We separate this section into three parts, calibration, simulation and re- 
sults. Similarly to Section 5 of Fan, Fan and Lv (2008), in the calibration 
part we want to calculate realistic multivariate distributions from which we 
can generate the factor loadings B, idiosyncratic noises {u.t}f^i and the 
observable factors {it}f^i. The data was obtained from the data library of 
Kenneth French's website. 

5.1. Calibration. To estimate the parameters in the Fama-French model, 
we will use the two-year daily data (yt,ft) from Jan 1st, 2009 to Dec 31st, 
2010 (T = 500) of 30 industry portfolios. 

(1) Calculate the least squares estimator B of yt = Bfj -|- u^, and take 
the rows of B, namely hi = (ftn, 612, 613), • • • , bso = (^30,1, &30,2, &30,3), to cal- 



GLS — 



[x'((sr)- 



0/r)-^x]-i[x'((sr)-i 



(gilr) V]- 



Vit = biifu + bi2f2t + bisfst + Uit, 
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Table 1 

Mean and covariance matrix used to generate 
b 



tJ-B 




Sb 




1.0641 


0.0475 


0.0218 


0.0488 


0.1233 


0.0218 


0.0945 


0.0215 


-0.0119 


0.0488 


0.0215 


0.1261 



culate the sample mean vector /Xg and sample covariance matrix S^. The 
results are depicted in Table 1. We then create a mutlivariate normal distri- 
bution N^^h^j'Sb), from which the factor loadings {bj}^^^ are drawn. 

(2) For each fixed p, create the sparse matrix S„ = D + ss' — diagjsf , . . . , 
Sp} in the following way. Let Uj = — Bfj . For i = 1, . . . , 30, let ai denote the 
standard deviation of the residuals of the ith portfolio. We find min(S'i) = 
0.3533, max(CTj) = 1.5222 and calculate the mean and the standard deviation 
of the aj's, namely a = 0.6055 and usd = 0.2621. 

Let D = diagjcr^, . . . , 0"p}, where iTi, . . . , Up are generated independently 
from the Gamma distribution G{a,/3), with mean a/3 and standard devia- 
tion We match these values to a = 0.6055 and itsd = 0.2621, to get 
a = 5.6840 and /3 = 0.1503. Further, we create a loop that only accepts the 
value of C7j if it is between min(aj) = 0.3533 and max(CTj) = 1.5222. 

Create s = to be a sparse vector. We set each Sj ~ iV(0, 1) 

with probability ^{^^p , and Sj = otherwise. This leads to an average of 

nonzero elements per each row of the error covariance matrix. 
Create a loop that generates S„ multiple times until it is positive definite. 

(3) Assume the factors follow the vector autoregressive [VAR(l)] model 
{t = n + ^ft-i + £t for some 3x3 matrix where St's are i.i.d. A'^3(0, S^). 
We estimate and 5]^ from the data, and obtain cov(fj). They are sum- 
marized in Table 2. 

5.2. Simulation. For each fixed p, we generate (bi , . . . , bp) independently 
from A''3(/Xg, 5]^), and generate {it}J=i and {u^}^^ independently. We keep 
T = 500 fixed, and gradually increase p from 20 to 600 in multiples of 20 



Table 2 

Parameters of ft generating process 







cov(ft) 










0.1074 
0.0357 
0.0033 


2.2540 
0.2735 
0.9197 


0.2735 
0.3767 
0.0430 


0.9197 
0.0430 
0.6822 


-0.1149 
0.0016 
-0.0399 


0.0024 
-0.0162 
0.0218 


0.0776 
0.0387 
0.0351 
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Averages xio"' Standard Deviations 
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100 200 300 400 500 600 % ^gg 2OO 300 400 500 600 

P P 

Fig. 1. Averages and standard deviations of — (dashed curve) and — 
(solid curve) over N — 200 iterations, as a function of the dimensionality p. 



to illustrate the rates of convergence when the number of variables diverges 
with respect to the sample size. 

Repeat the following steps N = 200 times for each fixed p: 

(1) Generate {bj}^^^ independently from A''3(/.i^, S^), and set B = (bi, 

...,bpy. 

(2) Generate {ut}^| independently from A^p(0, 

(3) Generate {ft}f=i independently from the VAR(l) model ft = /i + 

(4) Calculate yt = Bfj + ut for i = 1, . . . , T. 

(5) Set uJT = 0.10K^ylogp/T to obtain the thresholding estimator (2.5) Sj" 
and the sample covariance matrices cov(ft), = Tjrrj Ylt=iiyt ~ y)(yt ~ y)"^- 

We graph the convergence of 'S^ and Yiy to XI, the covariance matrix of y, 
under the entropy- loss norm || • and the elementwise norm || • ||max- We 

also graph the convergence of the inverses and to under 

the operator norm. Note that we graph that only for p from 20 to 300. Since 
T = 500, for 2 > 500 the sample covariance matrix is singular. Also, for p 
close to 500, is nearly singular, which leads to abnormally large values of 
the operator norm. Last, we record the standard deviations of these norms. 

5.3. Results. In Figures 1-3, the dashed curves correspond to and 
the solid curves correspond to the sample covariance matrix S^. Figures 1 
and 2 present the averages and standard deviations of the estimation error 
of both of these matrices with respect to the S-norm and infinity norm, 
respectively. Figure 3 presents the averages and estimation errors of the 
inverses with respect to the operator norm. Based on the simulation results, 
we can make the following observations: 

(1) The standard deviations of the norms are negligible when compared 
to their corresponding averages. ^ 

(2) Under the || • our estimate of the covariance matrix of y, I]'^ 
performs much better than the sample covaraince matrix 'Sy. Note that, in 
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Averages Standard Deviations 




100 200 300 400 500 600 '0 100 200 300 400 500 600 



P P 

Fig. 2. Averages and standard deviations of — SUmax (dashed curve) and 

|Sy — S||max (solid curve) over N — 200 iterations, as a function of the dimensional- 
ity p. 

the proof of Theorem 2 in Fan, Fan and Lv (2008), it was shown that 

(5.1) \\% -nl=o, (^) + o, (l) + O, . 

For a small fixed value of K, such as K = 3, the dominating term in (5.1) 
is O(^). From Theorem 4.1, and given that uit = o{p^^^), the dominating 

term in the convergence of ||S''" — S||| is Op(^ + "^^j?^^ ). So, we would 
expect our estimator to perform better, and the simulation results are con- 
sistent with the theory. 

(3) Under the infinity norm, both estimators perform roughly the same. 
This is to be expected, given that the thresholding affects mainly the elements 
of the covariance matrix that are closest to 0, and the infinity norm depicts 
the magnitude of the largest elementwise absolute error. 

(4) Under the operator norm, the inverse of our estimator, (S^)^^ also 
performs significantly better than the inverse of the sample covariance ma- 
trix. 

(5) Finally, when p > 500, the thresholding estimators 5]J and are 
still nonsingular. 




500 600 



Fig. 3. Averages and standard deviations of ""^ — S ^|| (dashed curve) and 

— S^^ll (solid curve) over N = 200 iterations, as a function of the dimensional- 
ity p. 
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In conclusion, even after imposing less restrictive assumptions on the error 
covariance matrix, we still reach an estimator that significantly outper- 
forms the standard sample covariance matrix. 

6. Conclusions and discussions. We studied the rate of convergence of 
high-dimensional covariance matrix of approximate factor models under var- 
ious norms. By assuming sparse error covariance matrix, we allow for the 
presence of the cross-sectional correlation even after taking out common fac- 
tors. Since direct observations of the noises are not available, we constructed 
the error sample covariance matrix, first based on the estimation residuals, 
and then estimated the error covariance matrix using the adaptive thresh- 
olding method. We then constructed the covariance matrix of yt using the 
factor model, assuming that the factors follow a stationary and ergodic pro- 
cess, but can be weakly dependent. It was shown that after thresholding, 
the estimated covarianc^ matrices are still invertible even if p>T, and the 
rate of convergence of and is of order Op{KmT\/logp/T), 

where K comes from the impact of estimating the unobservable noise terms. 
This demonstrates when estimating the inverse covariance matrix, p is al- 
lowed to be much larger than T. 

In fact, the rate of convergence in Theorem 2.1 reflects the impact of un- 
observable idiosyncratic components on the thresholding method. Generally, 
whether it is the minimax rate when direct observations are not available 
but have to be estimated is an important question, which is left as a research 
direction in the future. 

Moreover, this paper uses the hard-thresholding technique, which takes 
the form ofdij^aij) = aijl{\aij\ > Oij) for some pre-determined threshold Oij. 
Recently, Rothman, Levina and Zhu (2009) and Cai and Liu (2011) studied 
a more general thresholding function of Antoniadis and Fan (2001), which 
admits the form aij{9ij) = s{aij), and also allows for soft-thresholding. It is 
easy to apply the more general thresholding here as well, and the rate of 
convergence of the resulting covariance matrix estimators should be straight- 
forward to derive. 

Finally, we considered the case when common factors are observable, as 
in Fama and French (1992). In some applications, the common factors are 
unobservable and need to be estimated [Bai (2003)]. In that case, it is still 
possible to consistently estimate the covariance matrices using similar tech- 
niques as those in this paper. However, the impact of high dimensionality 
on the rate of convergence comes also from the estimation error of the un- 
observable factors. We plan to address this problem in a separate paper. 

APPENDIX A: PROOFS FOR SECTION 2 

A.l. Lemmas. The following lemmas are useful to be proved first, in 
which we consider the operator norm ||A|p = Ainax(A'A). 
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Lemma A.l. Let A be an mx m random matrix, B be an mxm deter- 
ministic matrix and both A and B are semi-positive definite. If there exists 
a positive sequence {ct}^=i such that for all large enough T, Amin(B) > ct- 
Then 

i^(Amin(A) > 0.5cr) > P(||A - B|| < 0.5ct) 

and 

p(^\\A~^ -B-^\\ < J-||A-B||^ >P(||A-B|| <0.5ct). 

Proof. For any v G M'" such that ||v|| = 1, under the event ||A — B|| < 
O.bcT, 

v'Av = v'Bv - v'(B - A)v > Amin(B) - II A - B|| 

> 0.5cT. 

Hence Amin(A) > 0.5ct- 

In addition, still under the event ||A — B|| < 0.5ct, 

||A-i - B^^ll = \\A-\B - A)B-i|| 

<Ai„in(A)-i||A-B||A„,in(B)-i 

= 2c;^^||A-B||. □ 

Lemma A. 2. Suppose that the random variables Zi,Z2 both satisfy the 
exponential-type tail condition: There exist ri, r2 £ (0, 1) and 61,62 > 0, such 
that Vs > 0, 

P{\Zi\>s)<exp{l-{s/hY^), i = l,2. 
Then for some r^, and 63 > 0, and any s > 0, 
(A.l) P(|^i^2| >s)< exp(l - {s/bsY'). 

Proof. We have, for any s > 0, M = {sbl''^''"' /biY^/'^''^+''^\ 6 = 6162 and 
r = rir2/{ri-\-r2), 

P{\ZiZ2\ >s)< P{M\Zi\ >s) + P{\Z2\ > M) 

< exp(l - (s/6iM)''i) + exp(l - {M/b2Y^) 

= 2exp(l- 

Pick up an ra G (0,r), and 63 > max{(r3/r)^/'"6, (1 + log2)^/^'6}; then it can 
be shown that F{s) = {s/bY — {s/b^Y^ increasing when s > 63. Hence 
F[s) > FQ}^) > log 2 when s > 63, which implies when s > 63, 

P(|ZiZ2| > s) < 2exp(l - (s/bY) < exp(l - {s/baY')- 

When s < 63, 

P(|ZiZ2| >s)<l<exp(l-(s/63r). □ 
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Lemma A. 3. Under the assumptions of Theorem 2.1, there exists a con- 
stant Cr > that does not depend on {p,T), such that when C > Cr, 

(i) 



P 



(ii) 



( if 

max — > UitUjt - CFij 

\i,3<v T ^ 



>C 



logp 



O 



(iii) 



( 1 ^ 

P max — ^{uitUjt - UitUjt) 

\i,3<V T ^ 



P max \aij — crij\ > C\ 

\ iJ<P ' 



logp 



>CaT ) =0[^ + Ki{p,T)], 



Proof, (i) By Assumption 2.1 and Lemma A. 2, UitUjt satisfies the 
exponential tail condition, with parameter ri/3 as shown in the proof of 
Lemma A. 2. Therefore by the Bernstein's inequality [Theorem 1 of Mer- 
levede, Peligrad and Rio (2009)], there exist constants Ci,C2,C3,Ci and 
C5 > that only depend on 61, ri and r2 such that for any i,j < p, and 
^-1 = 3r-i + r2-S 



P 



1^ 2^ UitUjt - (Tij 
\ t=l 



> s ] <T exp 



+ exp 

Using Bonferroni's method, we have 



(Ts) 
Ci 

{Ts? 
CaT 



+ exp 



exp 



C2{l + TGs] 

(2^g)7(l-7) 



C^logTs) 



' max — > UitUjt - a 

< p"^ max -P ^ ^ ^ UitUjt - (7ij > , 



Let s = C \l (logp)/T for some C > 0. It is not hard to check that when 
(logp)^/'*'^^ = o(r) (by assumption), for large enough C, 



p Texp 



(Tsy 



+ p exp 



(rs)2 / (rs)T(i-T) 



and 



exp 



V C2(i + rc73) 



exp 



VC5(lo? 



This proves (i). 
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(ii) For some C( > such that 

(A.2) P (max ^J2{uu-uu?>C[al^ =0{ki{p,T)) 

under the event {maxj<p|;^ ^^-|^ n^^ — crjj| < m.a'Ki<p an / 4} n {maxj<p x 
Ylt=ii^it ~ '^it)'^ ^ C'lO^}, by the Cauchy-Schwarz inequahty, 

1 ^ ^ ^ 

— '^{UitUjt - UitUjt) 



Z = max 

hj<p 



< max 

i,j<P 



t=l 
T 



j; ^{uit - Uit){ujt - Ujt 
t=i 

1 ^ 
< max- V (Sit 
t<p 1 ^ — f 



+ 2 max 

i,j<P 



1 ^ 



t=i 



UitY + 2, 



t=i 



t=i 



< C[a^ + 2^1 - max cTjj A /C(a|,. 



Since ay = o(l), when C > 3 y^C( maxj<p an , we have, for all large T, 



Cot > C'lQ^ + 2^1 - maxan^^J C[a^ 



and 



P{Z < Car) > 



1-P\ 



max 



1 ^ 



t=l 
T 



> max an /4 



P i max ^^(uit - Uit)'^ > C[a\. 



By part (i) and (A.2), P{Z < Car) > 1 - 0(p^^ + Ki(p,r)). 

(hi) By (i) and (ii), there exists Cr > 0, when C > Cr, the displayed in- 
equalities in (i) and (ii) hold. Under the event {viiaxij<p\^Ylt=iUitUjt — 

aij \ < Cy/{logp)/T} n {'aiaxij<p\j^Ylt=i uuUjt - uuUjtl < Car}, by the tri- 
angular inequality, 



max|(Tjj — fjjjl < max 
hj<p hj<p 



<C 



T T 

1 V- 1 V-^ ^ 

— } UitUjt - aij + max — } uuUj 

I — ' i,j<P ' — 



t=l 



logp 



jt — UitUjt 



t=l 



Hence the desired result follows from part (i) and part (ii) of the lemma. □ 



COVARIANCE MATRIX IN FACTOR MODEL 



23 



Lemma A. 4. Under Assumptions 2.1, 2.2, 



P[Cl< mmOij < max% <Cu)>l-0[^ + Ki{p,T) + K2(p, T) , 



1 



p. 



where 



CL = -r= minvar(ujiUjt), 
45 ij 

Cjj = 3max(Tjj + 4maxvar(uifUjt)- 

i<p ij 

Proof, (i) Using Bernstein's inequality and the same argument as in the 
proof of Lemma A.3(i), there exists > 0, when C > C'^ and (logp)^/"^"^ = 
o{T), 



P\ max 



1 ^ 

— ^^{uitUjt - CTijf - vaic{uitUjt) 



t=i 



>C 



logp 



T 



O 



For some C > 0, under the event 0^=1 where 



= < max|(Tjj — CTjjl < C 

i,j<P 



logp 



+ ar 



i<P,t<T 



A2 = < max \uit — uu] < min< —, \ [ 20max(Tii I minvar(uitnj( 



^4 



/ if ^ 

< max — > ti,-. — CTj. 

< max T^y^iuitUjt - aijf - \&i{uitUjt 



< C 



logp 



we have, for any by adding and subtracting terms, 



2 ^ 

< — y^iuitUjt - (Jijf + 2max(fj.y - 



U,7 
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< 4 max I u it - n jt p ^max an + max ^ ^ u| ^ 



+ 4var(uijiijt) + O ( y + + a| 



< |^2Cy i^+CflT + 2 max cjiij 

+ 4var(iiij'Ujf) +o(l), 

where the O(-) and o(-) terms are uniform in p an T. Hence under HiLi ^i: 
for all large enough T,p, uniformly in we have 

^ Smaxfjjj + 4maxvar(njtu,f). 

Still by adding and subtracting terms, we obtain 
1 ■ 
f 



it -CTij) 



t 

- J' '^{uitUjt - uuUjtf + ^ '^{uitUjt - dijf + 4(cr.y - dijf 
t t 

-J'Yl '^'iti^jt ~ %*)^ ^ ? ^ '^'jti'^it - Uitf + 4% + O + a^j^ 

< 8max|{ijf - Ujfpfmaxaji + max;^y^M^( ) +4% + o(l). 
Under the event ni^=i we have 



^ / log J) 

AOij + o(l) > minvar(uj(ii,t) - C\ — - 
«i V 1 



Smaxlujf — UjfP 

it 



1 . . 

> — mmvar(njtu,t). 



2CW + Ca-r + 2max(Tj: 



Hence for all large T,p, uniformly in we have 6ij > ^ miujj var(tijjUjt). 
Finally, by Lemma A. 3 and Assumption 2.2, 

i^l^QA,^ >l-o(^^ + Ki{p,T) + K2{p,T)y 
which completes the proof. □ 
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A. 2. Proof of Theorem 2.1. (i) For the operator norm, we have 

p 

-T,u\\ <m.ayi'y]\aijl{\aij\ >ujtOI-'^) - <Tij\- 
i<p '—^ ■' 

By Lemma A. 3 (hi), there exists Ci > such that the event 



A'^ = jmax \aij - aij\ < Ci y^^^^ + arj | 
occurs with probability P{A'^) > 1 - + ki(p, T)). Let C > be such that 

logp 



C\f(Ti > 2Ci, where Cl is defined in Lemma A. 4. Let ujt = C{y +(It), 
bx = Ci{yJ'^^fi- + ax), then ^/Clu^t > 26^, and by Lemma A. 4, 

>l-oQ + Ki(p,r) + K2(p,T)^. 

Define the following events: 

^'2 = {min^/V>26T}, 

^'3 = {max^/^<C7f }, 
where Cu is defined in Lemma A. 4. Under 0^=1 the event \aij\ > ^^T&ij 

1^ I ^1/2 

implies \aij \ > br, and the event \aij \ < ujtO^j implies \aij \ < 6t + vCuujt- 
We thus have, uniformly in i <p, under 0^=1 



p 



j=i i=i 



p p 



i=i i=i 



26 J. FAN, Y. LIAO AND M. MINCHEVA 

■\3 



By Lemmas A.3(iii) and A.4, P(nLi^*) > 1 " + Ki{p,T) + K2{p,T)), 
which proves the result. 

(ii) By part (i) of the theorem, there exists some C > 0, 



1 



P(||S^ -•S^\\>Cc0TmT) = O[- + ^i{p,T) + K2ip,T) 

By Lemma A.l, 

PiXraU^l) > 0.5A„,in(5]„)) > - SJI < 0.5A^i,(5:„)) 

>l-o(^-^ + Kiip,T) + K2{p,T) 

In addition, when lottut = o(l), 
P{\\{%lr^ - ^-^\\<2\\^"J\\Cu:TmT) 

>p{\\{T.l)-^-^z'\\<2\\i:-^\\ 

>P(||(£rri-S;i||<2||S-i|| 

-P{\\T,l-^u\\>CuTmT) 
>P(||£r-E„|| <0.5Amin(S„)) 

-o{^^ + Ki{p,T) + K2{p,T) 

>l-o(^-^ + Ki{p,T) + K2{p,T) 

where the third inequahty follows from Lemma A.l as well. 

APPENDIX B: PROOFS FOR SECTION 3 
B.l. Proof of Theorem 3.1. 



Lemma B.l. There exists Ci > such that: 
(i) 

T 



P max 



t=l 



(ii) 



P max 

\ k<K,i<p 



1 ^ 



t=l 
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Proof, (i) Let Zij = y Ylt=i ifitfjt - Efufjt)- We bound maxj^ | Zij \ us- 
ing a Bernstein-type inequality. Lemma A. 2 implies that for any i and j < K, 
fitfjt satisfies the exponential tail condition (3.3) with parameter r^/S. Let 
rj^ = 3r^^ +r2^, where r2 > is the parameter in the strong mixing condi- 
tion. By Assumption 3.3, r4 < 1, and by the Bernstein inequality for weakly 
dependent data in Merlevede, Peligrad and Rio [(2009), Theorem 1], there 
exist Cj > 0, i = 1, . . . , 5, for any s > 

/ (TsY^\ ( T'^s^ 

max P{\Zij \> s) <T exp I — — + exp 



(B.l) 



Ci J "V C2{l + TC3 

+ ^^Pl-C^^^PU5(logT.)-JJ- 
Using the Bonferroni inequality, 

P[ max \Zi^\ > s I < ii'^ maxPflZ,,! > s). 
^i<K,j<K ■' J i,j ^' 



Let s = C\J (logT)/r. For all large enough C, since = o{T), 

This proves part (i). 

(ii) By Lemma A. 2, and Assumptions 2.1(iii) and 3.3(ii), Z^.. ^ = fktuu sat- 
isfies the exponential tail condition (2.6) for the tail parameter 2rir3/(3ri -|- 
3r3), as well as the strong mixing condition with parameter r2. Hence again 
we can apply the Bernstein inequality for weakly dependent data in Mer- 
levede, Peligrad and Rio [(2009), Theorem 1] and the Bonferroni method 
on Z'f,^^ similar to (B.l) with the parameter = l.br^^ + l-5r^^ + r^"^. 

It follows from 3rj^^ + > 1 and 3r^^ + ^^2^^ > ^ that 72 < 1. Thus when 
s = C\J (logp) /r for large enough C, the term 

^^^^Pl-C2(l + TC3)J^^"' 

and the rest terms on the right-hand side of the inequality, multiplied by pK 
are of order o(p~^). Hence when (logp)^/'''^"^ = o{T) (which is implied by 
the theorem's assumption) and K = o{p), there exists C > 0, 



(B.2) P\ max 

\ k<K,i<p 



1 ^ 

t=l 
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Proof of Lemma 3.1. (i) Since K ^/^ogT = o{VT) , and Amm(-Eftf/) is 
bounded away from zero, for large enough T, by Lemma B.l(i), 



P 



(B.3) 



-XX'-M^f; 



> P \ K max 

\ i<K,j<K 



<0.5Amin(Mtf;) 
1 ^ 



t=i 



<0.5A^in(Mtf;) 



Hence by Lemma A.l, 

(B.4) P(A^i„(r-iXX') > 0.5A^in(Mtf;)) > 1 - o(^). 

As hi-hi = (XX')~^Xui, we have \\hi - b^f = u'.X'(XX')^2Xui. For 
C > such that (B.2) holds, under the event 

1 ^ 

f ^ fktUit 



A 

we have 



max 

k<K,i<p 



t=l 



<C' 



T 



n {A^in(r-ixx') > 0.5A^i„(Mtf;)}, 



< 



< 



Amm(-Eftfj) 

4X 



< 



Amin(-Efjf/) 

4A'C"2 logp 
Amm(-E'fjf/)2T 



—-7 max I — fktUit I 
;)2 fc<i^,i<pl T j 



1 I 1 



The desired result then follows from that P{A) > 1 — + 



(ii) For C > ma,Xi<K Ef^^, we have, by Lemma B.l(i), 



t=l 



+ KmaxEf^t>CK 

k<K 



o 



The result then follows from 

T 



1 



max^y \uu - Uit\'^ < max^y ||ft||^||bi - b 



t=\ 



<P T 



and part (i). 
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(iii) By Assumption 3.3, for any s > 0, 

P(max||fi|| > < TP(||fi|| > < TKm^KPifi, > s'/K) 

<TA'expf-f^=^ 
When s > C/^(logr)V'^3 f^v large enough C, that is, > 4bl\ 

The result then follows from 

max \uit — Uit\= max |(bj — bj)'ft| < maxllbj — bJI maxllfJI 

t<T,i<p t<T,i<p i " t 

and Lemma 3.1(i). □ 

Proof of Theorem 3.1. Theorem 3.1 follows immediately from The- 
orem 2.1 and Lemma 3.1. □ 

B.2. Proof of Theorem 3.2, part (i). Define 

DT = COT(ft)-COv(fj), Ct = B-B, 
E = (ui, . . . ,ut). 

We have 

IIS"^ - < 4||BDtB'||| + 24||BcOT(f)CT'||| 

+ 16||CTc5v(f)CT'||| + 2||S3" - 
We bound the terms on the right-hand side in the following lemmas. 

Lemma B.2. There exists C > 0, such that: 
(i) 

pf||D.||^>^^1^0(T-); 



P(\\Crf,>^^^)=OiT~^+p-^). 

Proof, (i) Similarly to the proof of Lemma B.l(i), it can be shown that 
there exists Ci > 0, 
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Hence sup^ maxj</^ < oo implies that there exists C > such that 



t=i t=i 



>c 



logT 



0{T- 



The result then follows from Lemma B.l(i) and that 



ID 



t\\f 



( 1 ^ 

\ t=i 

t=l t=l J 



+ max 



(ii) We have Cr = EX'(XX')"^ By Lemma B.l(ii), there exists C" > 
such that 



P\ max 

\ k,i 



1 ^ 



t=i 



>C' 



T 



o{p- 



Under the event 

T 



A = I max 



ktU'it 



<C' 



, logp 



_ n{An,in(r-ixxO >0.5A^in(Mtf;)}, 

t=l ' V ; I 

IICtIII^ < 4A;;.j^(Mff/)C"2pi^(logp)/r, which proves the result since 
^mm{Eitil) is bounded away from zero and P{A) > 1 — 0{T~^ due 
to (B.4). □ 

Lemma B.3. There exists C > such that: 
(i) 



P( IIBDtB'III + ||BSOT(ft)C^||| > 



, ..2 CKlogp CK^logT 



+ 



Tp 



0{T-'^+p- 



(ii) 



Pi ||CrcOT(f)CT lis > 



,,,2 ^ CpK^logpY' 



2^2 



OiT^'+p-'). 



Proof, (i) The same argument in Fan, Fan and Lv [(2008), proof of 
Theorem 2] implies that 

||B'5]-^B|| <2||cov(ft)~^|| =0(1). 
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Hence 

IIBDtB'III = tr(S-^/2Bj3^B/5^-iBj)^B/5^-i/2^ 
= tr(DTB'5]~iBDTB'S"iB) 



^^'^^ <P"^||DtB'S"^B|||. 

<0(p-1)||Dt|||. 

On the other hand, 
(B.7) ||Bcov(f)CT III < ST-^IIbXX'Ct'III + 8r-^||BXll'X'CT'|||. 
Respectively, 

IIBXX'Ct^'III < p "'"IIXX'C^^S 11^ II Cj'XX'B'S "'"B||j?, 

(B.8) 

llBXll'X'C^lll < p"i||Xll'X'C^S^^||ir||CTXll'X'B'S^iB||i7'. 

By Lemma B.l(i), and M^f/ < oo, P(||XX'|| > TC) = 0{T~^) for some 
C > 0. Hence, Lemma B.2(ii) imphes 

(B.9) P(||BXX'Ct'||| > C'TKlogp) = 0{T~^ + p^^) 

for some C > 0. In addition, the eigenvahies of cov(ft) = T^^XX' — 
T~^X11'X' are all bounded away from both zero and infinity with probabil- 
ity at least 1 — 0(T~^) [implied by Lemmas B.l(i), A.l and Assumption 3.4]. 
Hence for some Ci > 0, with probability ast least 1 — 0(T~^), 

IIXII'X'II < IITXX'II <T^Ci, 

(B.IO) 

llBXll'X'C^lll < 0(p~^)||Xll'X'f IICtII?.. 

The result then follows from the combination of (B.6)-(B.10) and Lemma B.2. 
(ii) Straightforward calculation yields 

p||CTCOT(f)Cr'||| = tr(CTCOT(f)Cr'5]~iCrSOT(f)Cr'S"^) 

< ||CTCOT(f)CT S-^lll 

<ALx(S-^)ALx(c^(ft))l|CT||^. 

Since ||cov(ff)|| is bounded, by Lemma B.l(i), X'^^^icov (it)) is bounded 
with probability at least 1 — 0{T~'^). The result again follows from Lem- 
ma B.2(ii). □ 

Proof of Theorem 3.2, part (i). (a) We have 
(B.ll) < ||S-i/2(s3;-E„)5]-i/2|| 

— W^Tl ~ ^m|I ■ -^max(S "'")• 
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Therefore, (B.5), (B.ll), Theorem 3.1 and Lemmas B.2, B.3 yield the result, 
with the fact that [assuming log T = o{p)] 

Klogp ^ K'^logT ^ pA'^(logp)^ ^ m'^K'^logp 



Tp 



2^2 



T 



O 



pK'^{logp)'^ ^ m'^K'^logp 



rp2 ' 

(b) For the infinity norm, it is straightforward to find that 



(B.12) 



S||max < ||2CTCov(ft)B'||MAX + ||BDtB'||max 

+ ||CTCOv(ft)C'2^||MAX + ||2BDtC^||max 



+ IICtDtC'j-IImax + II S„ - 



MAX- 



By assumption, both ||B||max and ||cov(f()||MAX are bounded uniformly 
in {p,K,T). In addition, let ej be a p-dimensional column vector whose 
ith. component is one with the remaining components being zeros. Then 
under the events ||Dt||max < Cy/{logT)/T, maxi<K,j<p\j:Ylt=ifitUjt\ < 
C^{\ogp)/T, II^^XX'll < C, and maxi<p||bi - bi|| < C^K[\ogp)/T, we 
have, for some C" > 0, 



1120^ cov(ff)B'||MAX < 2max ||e'jCTCOv(fj)B'e,-|| 

«J<p 

(B.13) < 2max ||bj — bj||||cov(fj)|| max||bj| 

i<p j<p 



<C'K 



|Ct||max = max 



\ogp 



T 



T 



-1 



(B.14) 



(B.15) 



< max 

i<p 



e-^^EX' 



< \fK max 

i<K^i<p 



T 



t=i 



-XX' 

T 



fitUji 



< C'^K{\ogp)/T, 



-XX' 



IBDtB'IImax < A'^||B||^^xI|Dt||max ^ C'K^ 



logT 



T 



|CTCOv(ft)C'j'||MAX < max ||e'jCT cov(ff)C^ej| 
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(B.16) <max||e^CTf ||cov(fi) 

i<p 

^ C'K^logp 



(B.17) 



and 



T 

2BDtC^||max < 2K^||B||max||Dt||max||Ct| 



MAX 



(B.18) IICrDrC^llMAX < -f^^||DT||MAx||CT||MAX = O ( -f^^ ■ ^ 



Moreover, the {i,j)th. entry of S„ — Xl^ is given by 

ydij-CTij, o.w. 



Hence - 5]m||max < maxij<p |cjjj - aij\ + uJTm.a-Kij<p yjdij, which im- 
phes that with probabiUty at least 1 — 0{p~'^ + T~^), 



(B.19) ||£r-Sj|MAX<C"K 



logp 



T 



The result then follows from the combination of (B.12)-(B.19), (B.4) and 
Lemmas 3.1, B.l. □ 

B.3. Proof of Theorem 3.2, part (ii). We first prove two technical lem- 
mas to be used below. 

Lemma B.4. (i) Amin(B'5];;^B) > cp for some c> 0. 
(ii) ||[cov(f)-i+B'5]-iB]-i||=0(p-i). 

Proof, (i) We have 

Amin(B'5]„"^B) > Amm(S„"^)Amin(B'B). 

It then follows from Assumption 3.5 that Amin(B'B) > cp for some c > and 
all large p. The result follows since is bounded away from infinity, 

(ii) It follows immediately from 

Amin(cov(fj)-i + B'S-iB) > A„,in(B'S-iB). □ 
Lemma B.5. There exists C > such that: 
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(i) 



(ii) 

(iii) /orG = [cOT(f)-i+B'(S7")-iB]-i, 

P(||BGB'(Srri|| > C) = of 1 + 



Proof, (i) Let H = ||B'(sJ)-1B - B'S-^BH. 

H < 2||c^s,;;iB|| + 2\\c't{{^1)-' - 5:-i)B|| 

+ \\B"{i^lr'-^Z'm + \\C'T^u'CT\\ 

+ \\c'Ai^lr'-^;,')CT\\. 

The same argument of Fan, Fan and Lv (2008) (equation 14) implies that 
||B||i? = 0{y/p). Therefore, by Theorem 3.1 and Lemma B.2(ii), it is straight- 
forward to verify the result. 

(ii) Since 110^11^;' > ||Dr||, according to Lemma B.2(i), there exists C > 
such that with probability ast least 1 - 0{T-^), ||Dt|| < C Ky^{logT)/T. 
Thus by Lemma A.l, for some C" > 0, 

PiWcmiit)-' - coY{it)-'\\ < C"\\T>t\\) >p(^\^t\\ < C'K^I^^ 

> i-0(r-2), 

which implies 



(B.20) p|^||c5V(fO"^ - cov(f4)-i|| < C"C'K^^^j > 1 -0(T-2). 

Now let A = COT(ft)-^ + B'{T,ly^B, and A = cov(ft)-i + B'S^^^B. Then 
part (i) and (B.20) imply 



p[\\A-A\\<C''C'Kf^ + Cpn^rK{^ 

(B.21) 
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In addition, mT^^/ {logp)/T = o(l). Hence by Lemmas A.l, B.4(ii), for 
some C > 0, 

P(Amin(A) > Cp) > P{\\A -A\\<Cp)>l-o(\ + 



p2 T2 

which imphes the desired result. 

(iii) By the triangular inequality, ||B||i? < ||Cr||i? + 0{^/p). Hence Lem- 
ma B.2(ii) implies, for some C > 0, 

(B.22) ^'(IIbiIf < cvp) > 1 - o(r-2 + p-2). 

In addition, since is bounded, it then follows from Theorem 3.1 that 

is bounded with probability at least 1 — 0{p~^ + T~^). The result 
then follows from the fact that 

P(\\G\\>cp"')=o(^^+^y 

which is shown in part (ii). □ 

To complete the proof of Theorem 3.2, part (ii), we follow similar lines of 
proof as in Fan, Fan and Lv (2008). Using the Sherman-Morrison-Woodbury 
formula, we have 

||(S^)-i-5]"i|| 

+ m^lr' - S^i)B[aw(f)"i +B'(Sr)-iB]-iB'(Er)-i|| 

+ ||((Sl)-i - S-i)B[55^(f)-i +B'(El)-iB]-iB'S-i|| 

(B.23) + \\^-\B - B)[c5V(f)-i + B'i±lr'Br'B'^-'\\ 

S-HB - B)[c5V(f)-i + B'(£r)-iB]-iB'l];:i|| 

5]-iB([c5v(f)-i + B'(Sl)-iB]-i 

-[cov(f)-i+B'l]-iB]-i)B'S-1 

= Li + L2 + Ls + ^4 + ^5 + Lq. 

The bound of Li is given in Theorem 3.1. 
For G = [55^(f)^i + B'i^ly^B]^^, then 

(B.24) L2 < mlr' - ^-'W ■ \\BGB'{^:)-% 

It follows from Theorem 3.1 and Lemma B.5(iii) that 



P\ L2<CmTK\r-^] >l-o(^ + 
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The same bound can be achieved in a same way for -L3. For L4, we have 
i4<||5];:i^-||B-B||-||B||-||G||. 
It fohows from Lemmas B.2, B.5(ii), and inequahty (B.22) that 

The same bound also apphes to L5. Finahy, 
L6< ||Bf ||A^i - A^^ll < ||Bf ||A- All • ||A~^|| • ||A^i||, 

where both A and A are defined after inequahty (B.20). By Lemma B.4(ii), 
||A-i|| =0(p-i). Lemma B.5(ii) imphes P(|| A"! || > Cp-^) = 0(p-2 + r-2). 
Combining with (B.21), we obtain 

The proof is completed by combining Li^ Lq. 



APPENDIX C: PROOFS FOR SECTION 4 

The proof is similar to that of Lemma 3.1. Thus we sketch it very briefly. 
The OLS is given by 

hi = (X-Xi)~^X-yi, i < p. 

The same arguments in the proof of Lemma B.l can yield, for large enough 
C>0, 



which then implies the rate of 

^ T 1 

max — ^^(uif - Uitf < max||bj - bi|p— ||fit ||^. 
i<p 1 — ' %<v 1 — ' 

t=\ t=i 

The result then follows from a straightforward application of Theorem 2.1. 
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